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ABSTRACT 

Recent developments in optimization theory have extended some traditional al¬ 
gorithms for least-squares optimization of real-valued functions (Gauss-Newton, 
Levenberg-Marquardt, etc.) into the domain of complex functions of a complex vari¬ 
able. This employs a formalism called the Wirtinger derivative, and derives a full- 
complex Jacobian counterpart to the conventional real Jacobian. We apply these de¬ 
velopments to the problem of radio interferometric gain calibration, and show how the 
general complex Jacobian formalism, when combined with conventional optimization 
approaches, yields a whole new family of calibration algorithms, including those for 
the polarized and direction-dependent gain regime. We further extend the Wirtinger 
calculus to an operator-based matrix calculus for describing the polarized calibration 
regime. Using approximate matrix inversion results in computationally efficient im¬ 
plementations; we show that some recently proposed calibration algorithms such as 
StefCal and peeling can be understood as special cases of this, and place them in 
the context of the general formalism. Finally, we present an implementation and some 
applied results of Coh Jones, another specialized direction-dependent calibration al¬ 
gorithm derived from the formalism. 

Key words: Instrumentation: interferometers, Methods: analytical, Methods: nu¬ 
merical, Techniques: interferometric 


INTRODUCTION 

In radio interferometry, gain calibration consists of solving 
for the unknown complex antenna gains, using a known 
(prior, or iteratively constructed) model of the sky. Tra¬ 
ditional (second generation, or 2GC) calibration employs 
an instrumental model with a single direction-independent 
(DI) gain term (which can be a scalar complex gain, or 
2x2 complex-valued Jones matrix) per antenna, per some 
time/frequency interval. Third-generation (3GC) calibration 
also addresses direction-dependent (DD) effects, which can 
be represented by independently solvable DD gain terms, 
or by some parameterized instrumental model (e.g. pri¬ 
mary beams, pointing offsets, ionospheric screens). Dif¬ 
ferent approaches to this have been proposed and imple¬ 
mented, mostly in the framework of t he r ad io interferome ¬ 
try measuremen t equation (RIME, see lHamaker et abll 19961 ): 
I Smirnov! (l2011al lblla) provides a recent overview. In this work 
we will restrict ourselves specifically to calibration of the DI 
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and DD gains terms (the latter in the sense of being solved 
independently per direction). 

Gain calibration is a non-linear least squares (NLLS) 
problem, since the noise on observed visibilities is al¬ 
most always Gaussian (t h ough other trea tments have been 
proposed by iKazemi fe Yatawattal l2013l h Traditional ap¬ 
proaches to NLLS problems inv olve various gradient -based 
techniques (for an overview, see iMadsen et alj 120041 ). such 
as Gauss-Newton (GN) and Levenberg-Marquardt (LM). 
These have been restricted to functions of real variables, 
since complex differentiation can be defined in only a very 
restricted sense (in particular, dz/dz does not exist in the 
usual definition). Gains in radio interferometry are complex 
variables: the traditional way out of this conundrum has 
been to recast the complex NLLS problem as a real prob¬ 
lem by treating the real and imaginary parts of the gains as 
independent real variables. 

Recent developments in optimiza tion theory 
dKreutz-Delgadol 120091 : lLaurent et all l2012l l have shown 
that using a formalism c alled the Wirtinger complex 
derivative dWirtingei! Il927l l allows for a mathematically 
robust definition of a complex gradient operator. This leads 
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to the construction of a complex Jacobian J, which in 
turn allows for traditional NLLS algorithms to be directly 
applied to the complex variable case. We summarize these 
developments and introduce basic notatio n in Sect. [T] 
In Sect. [2j we follow on from iTassel d2014h to apply this 
theory to the RIME, and derive complex Jacobians for 
(unpolarized) DI and DD gain calibration. 

In principle, the use of Wirtinger calculus and complex 
Jacobians ultimately results in the same system of LS equa¬ 
tions as the real/imaginary approach. It does offer two im¬ 
portant advantages: (i) equations with complex variables are 
more compact, and are more natural to derive and analyze 
than their real/imaginary counterparts, and (ii) the struc¬ 
ture of the complex Jacobian can yield new and valuable 
insights into the problem. This is graphically illustrated in 
Fig. □ (in fact, this figure may be considered the central 
insight of this paper). Methods such as GN and LM hinge 
around a large matrix J H J - with dimensions correspond¬ 
ing to the number of free parameters; construction and/or 
inversion of this matrix is often the dominant algorithmic 
cost. If J H J can be treated as (perhaps approximately) 
sparse, these costs can be reduced, often drastically. FigureQ] 
shows the structure of an example J H J matrix for a DD gain 
calibration problem. The left column row shows versions of 
J H J constructed via the real/imaginary approach, for four 
different orderings of the solvable parameters. None of the 
orderings yield a matrix that is particularly sparse or easily 
invertible. The right column shows a complex J H J for the 
same orderings. Panel (f) reveals sparsity that is not appar¬ 
ent in the real/imaginary approach. This sparsity forms the 
basis of a new fast DD calibration algorithm discussed later 
in the paper. 

In Sect. [31 we show that different algorithms may be de¬ 
rived by combining different sparse approximations to J H J 
with conventional GN and LM methods. In particular, we 
show that S tefCal, a fast D I calib ration algorithm recently 
proposed by ISalvini fe Wiinholdj iteOMal h can be straight¬ 
forwardly derived from a diagonal approximation to a com¬ 
plex J H J. We show that the complex Jacobian approach 
naturally extends to the DD case, and that other sparse 
approximations yield a whole family of DD calibration algo¬ 
rithms with di fferent sc aling properties. One such algorithm, 
CohJones (lTassdl2014T ) . has been implemented and success¬ 
fully applied to simulated LOFAR data: this is discussed in 
Sect. [2] 

In Sect. [J] we extend this approach to the fully polar¬ 
ized case, by developing a Wirtinger-like operator calculus 
in which the polarization problem can be formulated suc¬ 
cinctly. This naturally yields fully polarized counterparts to 
the calibration algorithms defined previously. In Sect. [5j we 
discuss other algorithmic variations, and make co nnections 
to old er DD calibration techniques such as peeling (iNoordaml 

12004 ) . 

While the scope of this work is restricted to LS solutions 
to the DI and DD gain calibration problem, the potential ap¬ 
plicability of complex optimization to radio interferometry 
is perhaps broader. We will return to this in the conclusions. 
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order of operations is Xjj = X^ Y = (Xu) y , 
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data, model, residuals, gains, sky coherency 
value associated with iteration k 
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value associated with direction d 
matrix of weights 

partial, conjugate partial Jacobian at iteration k 

full complex Jacobian 

J H J and its approximation 

vectorization operator 

right-multiply by A operator (Sect. n 

left-multiply by A operator (Sect. 0 

Kronecker delta symbol 


matrix blocks 


repeated matrix block 


1 WIRTINGER CALCULUS & COMPLEX 
LEAST-SQUARES 


The traditional approach to optimizing a function of n com¬ 
plex variables f(z), z £ C™ is to treat the real and imaginary 
parts z = x + iy independently, turning / into a function 
of 2 n real variables f(x,y), and the problem into an opti- 

mization over R 2n . _ _ _ 

iKreutz-Delgadol fl2009l ) and lLaurent et alJ (120121 ') pro- 
pose an a lterna tive approach to the problem based on 
IWirtingerl (|l927l l calculus. The central idea of Wirtinger cal¬ 
culus is to treat z and z as independent variables, and op¬ 
timize /(z,z) using the Wirtinger derivatives 


d__l fd_ _ ,d_\ d_ _ 1 / d_ ,d_\ 

dz 2 \ dx 1 dyJ dz 2 \dx + 1 dyJ ’ 

where z = x + iy. It is easy to see that 


dz 

Wz 



( 1 . 1 ) 


( 1 . 2 ) 


i.e. that z (z) is treated as constant when taking the deriva¬ 
tive with respect to z (z). From this it is straightforward to 
define the complex gradient operator 
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from which definitions of the complex Jacobian and complex 
Hessians naturally follow. The authors then show that vari¬ 
ous optimization techniques developed for real functions can 
be reformulated using complex Jacobians and Hessians, and 
applied to the complex optimization problem. In particu¬ 
lar, they generalize the Gauss-Newton (GN) and Levenberg- 
Marquardt (LM) methods for solving the non-linear least 
squares (NLLS) problenf] 

min \\r(z, z)\\ F , or min ||d - v(z, z)\\ F (1.4) 

Z Z 

where r,d,v have values in C m , and || ■ ||_f is the Frobenius 
norm. The latter form refers to LS fitting of the parame¬ 
terized model v to observed data d, and is the preferred 
formulation in the context of radio interferometry. 

Complex NLLS is implemented as follows. Let us for¬ 
mally treat z and z as independent variables, define an aug¬ 
mented parameter vector containing both, 


« = 


z 

z 


(1.5) 


and designate its value at step k by z k - Then, define 


J k 


dv 


(z k ), 


J k* 


dv 

ai 


(Zfc), 


Vk = 


r(z k ) 

r{zk) 


( 1 . 6 ) 


We’ll call the m x n matrices Jk and Jk * the partial and 
partial conjugate Jacobiar^ respectively, and the 2m -vector 
rk the augmented residual vector. The complex Jacobian of 
the model v can then be written (in block matrix form) as 


J = 


J k 

J k* 


J k* 

J k 


(1.7) 


with the bottom two blocks being element-by-element con¬ 
jugated versions of the top two. Note the use of J to indi¬ 
cate element-by-element conjugation - not to be confused 
with the Hermitian conjugate which we’ll invoke later. J is 
a 2m x 2 n matrix. The GN update step is defined as 


Sz = 


5z 

Sz 


/ tH t \ — 1 t H ~ 

(J J) J Vki 


( 1 . 8 ) 


The LM approach is similar, but introduces a damping 
parameter A: 

= {J H J + \D)~ 1 2 J H r k , (1.9) 



where D is the diagonalized version of J H J. With A = 0 this 
becomes equivalent to GN, with A —> oo this corresponds to 
steepest descent (SD) with ever smaller steps. 

Note that while Sz and Sz are formally computed inde¬ 
pendently, the structure of the equations is symmetric (since 
the function being minimized - the Frobenius norm - is real 
and symmetric w.r.t. z and z), which ensures that Sz = <5z. 
In practice this redundancy usually means that only half the 
calculations need to be performed. 


lLaurent et al.l J2012f l show that Eqs. 11.81 and 11.91 yield 
exactly the same system of LS equations as would have been 
produced had we treated r(z) as a function of real and imag¬ 
inary parts r(x,y), and taken ordinary derivatives in R 2n . 
However, the complex Jacobian may be easier and more el¬ 
egant to derive analytically, as we’ll see below in the case of 
radio interferometric calibration. 


2 SCALAR (UNPOLARIZED) CALIBRATION 

In this section we will apply the formalism above to the 
scalar case, i.e. that of unpolarized calibration. This will 
then be extended to the fully polarized case in Sect. [4] 


2.1 Direction-independent calibration 


Let us first explore the simplest case of direction- 
independent (DI) calibration. Consider an interferometer ar¬ 
ray of iV an t antennas measuring IVbi = IV an t (IV an t —1)/2 pair¬ 
wise visibilities. Each antenna pair pq (1 ^ p < q ^ IV ant ) 
measures the visibility^ 


9v m vq9<i J" n pq > 


( 2 . 1 ) 


where m pq is the (assumed known) sky coherency, g p is the 
(unknown) complex gain parameter associated with antenna 
p, and n vq is a complex noise term that is Gaussian with a 
mean of 0 in the real and imaginary parts. The calibration 
problem then consists of estimating the complex antenna 
gains g by minimizing residuals in the LS sense: 

min V'' \r pq \ , r pq = d pq — g P m Pq g q , (2-2) 
9 i —' 
pi 

where d pq are the observed visibilities. Treating this as a 
complex optimization problem as per the above, let us write 
out the complex Jacobian. With a vector of IV an t complex 
parameters g and IVbi measurements d pq , we’ll have a full 
complex Jacobian of shape 2IVbi x 2IV an t. It is conventional 
to think of visibilities laid out in a visibility matrix; the nor¬ 
mal approach at this stage is to vectorize d pq by fixing a 
numbering convention so as to enumerate all the possible 
antenna pairs pq (p < q) using numbers from 1 to IVbi. In¬ 
stead, let us keep using pq as a single “compound index”, 
with the implicit understanding that pq in subscript corre¬ 
sponds to a single index from 1 to IVbi using some fixed 
enumeration convention. Where necessary, we’ll write pq in 
square brackets (e.g. ar pi3 i j) to emphasize this. 

Now consider the corresponding partial Jacobian Jk 
matrix (Eq. 11.611 . This is of shape IVbi x IV an t. Using the 
Wirtinger derivative, we can write the partial Jacobian in 
terms of its value at row \pq\ and column j as 


[J 


kJW.i 


m pq g q , j=p, 

0, otherwise. 


(2.3) 


1 It should be stressed that Wirtinger calculus can be applied to 
a broader range of opt imization problems than just LS. 

2 lLaurent et al.l d2012h define the Jacobian via dr rather than 
dv. This yields a Jacobian of the opposite sign, and introduces a 
minus sign into Eqs. |l.8l and |1.9l In this paper we use the dv con¬ 
vention, as is more common in the context of radio interferometric 
calibration. 


3 In principle, the autocorrelation terms pp, corresponding to the 
total power in the field, are also measured, and may be incorpo¬ 
rated into the equations here. It is, however, common practice to 
omit autocorrelations from the interferometric calibration prob¬ 
lem due to their much higher noise, as well as technical difficulties 
in modeling the total intensity contribution. The derivations be¬ 
low are equally valid for p ^ q, we use p < q for consistency with 
practice. 
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In other words, within each column j, Jk is only non¬ 
zero at rows corresponding to baselines [jq]. We can express 
this more compactly using the Kronecker delta: 

j = 1... A/ant 

Jk = [m pq g q 5 J p \}[ P q]=i...N bl ( P <q ) (2-4) 

Likewise, the conjugate partial Jacobian Jk * may be written 
as 


j — 1 • •. iVant 

/ A \ 

Jk * = [ffpnV 3 <^]}[pg]= i...JV b i (p<q) (2.5) 


A specific example is provided in Appendix m The full com¬ 
plex Jacobian (Eq. 11.711 then becomes, in block matrix no¬ 
tation, 


j — 1---N an t j — 1... iVant 


J 


^pqdqfip gpmpqPq 

^TpqCjpSq dq^pq^p 


}[ P9 ]=i...jv bl (p < q) 

}M=i...JV bl (p < q ) 


( 2 . 6 ) 

where the [ pq] (p < q) and j subscripts within each block 
span the full range of 1... IVbi and 1... ./V an t. Now, since 
d pq = d q p and m pq = fh qp , we may notice that the bottom 
half of the augumented residuals vector r corresponds to the 
conjugate baselines qp (q > p): 


r pq 


dpq 

g P mpqg q 


Tpq 

}[p(j]=l...JV b l (p<q) 

fpq 


dpq 

— g p mpqg q 


Tip. 

}[p(j]=l...AT b i (p<q) 


(2.7) 

as does the bottom half of J in Eq. [S3 Note that we are free 
to reorder the rows of J and r and intermix the normal and 
conjugate baselines, as this will not affect the LS equations 
derived at Eq. [i~9l This proves most convenient: instead of 
splitting J and f into two vertical blocks with the compound 
index [pq] (p < q) running through IVbi rows within each 
block, we can treat the two blocks as one, with a single 
compound index \pq\ (p ^ q) running through 2A r b i rows: 

j = 1 ■ • • AT an t j = l...N an t 

J = [ m pq g q 8p g P mpq8 3 q ], r = [r pq \ }M=i... 2 jv bl 

( 2 . 8 ) 

where for q > p, r pq = r qp and m pq = m qp . For clar¬ 
ity, we may adopt the following order for enumerating the 
row index [pq]: 12,13,..., In, 21, 22,..., 2n, 31, 32,..., 3n, 
..., nl,..., nn — 1. 

Equation I A2I in the Appendix provides an example of J 
for the 3-antenna case. For brevity, let us define the short¬ 
hand 


Vpq — m pq g q . 


(2.9) 


since the value at row i, column j of each block is 


Aij — 'y ' Vpq Vpq 8p S J p 
pq 

JJij = "y ' VpqVqpJpJq 

pq 

Cij = y ] yqpVpqfiqfip 

pq 

JJii = y ' ypqypqJfiq 

pq 


f J2 *=j 
l 0, 

f ViiVji, 

\ 0, i=j 

Bij 


Aij 


( 2 . 11 ) 


We then write J H J in terms of the four Aant x Aant 
blocks as: 


J H J = 


diag E \Vig\ 

q^i 


jl/ijUii, 

\ 0, i=j 


(yijya, &} 

X 0, i=j 


diag E \Viq\ 

q^i 


( 2 . 12 ) 


Equation I A3I in the Appendix provides an example of J H J 
for the 3-antenna case. 

The other component of the LM/GN equations 
(Eq. 11.91) is the J H r term. This will be a column vector 
of length 2Aant. We can write this as a stack of two lV an t- 
vectors: 


t h - 

J r = 


S Vp<i r p<i^p 

pq 


E Viq r iq 

q^i 

S yQP r PQ^q 

-PQ 


E Viqriq 

-q^i 


— l-.-JVant 
} I - ■ -^ant 


(2.13) 


with the second equality established by swapping p and q in 
the bottom sum, and making use of r pq = f qp . Clearly, the 
bottom half of the vector is the conjugate of the top: 


Ci — y ] PiqViq. 

qjii 


(2.14) 


2.2 Computing the parameter update 


Due to the structure of the RIME, we have a particularly 
elegant way of computing the GN update step. By analogy 
with the augmented residuals vector r, we can express the 
data and model visibilities as 2A^bi-vectors, using the com¬ 
pound index [pq] (p ^ q): 

d=[d pq ], v = [g p mp q g q ], r = d — v (2-15) 

As noted bv lTassel (I2014T) . we have the wonderful prop¬ 
erty that 


v = J L g = -Jg, where g 


9 

9 


(2.16) 


(where Xl designates the left half of matrix X - see Table[T] 
for a summary of notation), which basically comes about 
due to the RIME being bilinear with respect to g and g. 
Substituting this into the GN update step, and noting that 
X(Y l ) = (XY) L , we have 


We can now write out the structure of J H J. This is Hermi- 
tian, consisting of four Aant X AI an t blocks: 


A 

B 


A 

B 

C 

D 


B" 

A 


( 2 . 10 ) 


Sg 

5 g_ 


(, J H J)~ 1 J H {d-Ji J g ) = (J H J)~ 1 J H d~g. (2.17) 


Consequently, the updated gain values at each iteration can 
be derived directly from the data, thus obviating the need 
for computing residuals. Additionally, since the bottom half 
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of the equations is simply the conjugate of the top, we only 
need to evaluate the top half: 

g k+1 = g k + 5g = (J H J)- 1 v J H d, (2.18) 

where X\j designates the upper half of matrix X. 

The derivation above assumes an exact inversion of 
H = J H J. In practice, this large matrix can be costly to 
invert, so the algorithms below will substitute it with some 
cheaper-to-invert approximation H. Using the approximate 
matrix in the GN update equation, we find instead that 


2.4 Weighting 

Although lLaurent et al.l ( 20121 ) do not mention this explic¬ 
itly, it is straightforward to incorporate weights into the 
complex LS problem. Equation [T3] is reformulated as 

min \\Wr(z, z)||f, (2.24) 

Z 

where W is an M x M weights matrix (usually, the inverse 
of the data covariance matrix C). This then propagates into 
the LM equations as 


9k+i = H \j H d + (I H \jif L )fl fc) (2.19) 

which means that when an approximate J H J is in use, the 
shortcut of Eq. 12.181 only applies when 

H \H h = I. (2.20) 

We will see examples of both conditions below, so it is 
worth stressing the difference: Eq. l2.18l allows us to compute 
updated solutions directly from the data vector, bypassing 
the residuals. This is a substantial computational shortcut, 
however, when an approximate inverse for H is in use, it 
does not necessarily apply (or at least is not exact). Under 
the condition of Eg. 12.201 however, such a shortcut is exact. 


Sz = ( J H WJ + XI)- 1 J H Wr k . (2.25) 


Adding weights to Eqs. 12.221 and 12.231 we arrive at the 
following: 



diag X w iq3 \y 2 iqs \ 



qjti,s 


J H WJ = 

[ S W ijs VijsDjis , 



s 

\ 


_ [ 0 , i—'j 



(2.26) 


J H Wr 


yi IViqsyiqs'Tiqs 

q^i,s 


i H 


(2.27) 


2.3 Time/frequency solution intervals 

A common use case (especially in low-SNR scenarios) is to 
employ larger solution intervals. That is, we measure mul¬ 
tiple visibilities per each baseline pq, across an interval of 
timeslots and frequency channels, then obtain complex gain 
solutions that are constant across each interval. The mini¬ 
mization problem of Eg. I2.1l can then be re-written as 

min \ lu»qs| , 'Tpqs — dpqs (2-21) 

9 i —' 
pqs 

where s = 1 ,...,N S is a sample index enumerating all the 
samples within the time/frequency solution interval. We can 
repeat the derivations above using [pgs] as a single com¬ 
pound index. Instead of having shape 2Abi x 2A an t, the Ja¬ 
cobian will have a shape of 2AbiA s x 2A r an t, and the residual 
vector will have a length of 2AbiA s . In deriving the J H J 
term, the sums in Eo. l2.10l must be taken over all pqs rather 
than just pq. Defining the usual shorthand of y pqs = m pqs g q , 
we then have: 


J H J = 


diag J2 \Viqs\ 
q^i,s 


X yijsVji 

s 

0, 




( 2 . 22 ) 


where the symbols \ and /* H represent a copy and a copy- 
transpose of the appropriate matrix block (as per the struc¬ 
ture of Eq. 12.1011 . Likewise, the J H r term can be written 
as: 


t h ~ 

J r = 


X yiqsTiqs 
q^i,s 




(2.23) 


2.5 Direction-dependent calibration 

Let us apply the same formalism to the direction-dependent 
(DD) calibration problem. We reformulate the sky model 
as a sum of Adi r sky components, each with its own DD 
gain. It has been common practice to do DD gain solu¬ 
tions on larger time/frequency intervals than DI solutions, 
both for SNR reasons, and because short intervals lead to 
under-constrained solutions and suppression of unmodeled 
sources. We therefore incorporate solution intervals into the 
equations from the beginning. The minimization problem 
becomes: 

JVdlr 

mj n ^|r P q S | 2 , r pqa = d P qsgp d) m ( p d f a g ( q d) . (2.28) 

pqs d= 1 

It’s obvious that the Jacobian corresponding to this problem 
is very similar to the one in Eq. 12.81 but instead of having 
shape 2Abi x 2A an t, this will have a shape of 2AbiA s x 
2A an t Adir- We now treat [pqs] and [jd] as compound indices: 


j — 1"*-Nant j — l...iV an t 

d=l...iV dir d=l...JV dir 

1 [pq]= i—2N h i (p^q) 

J s=l...N a 

(2.29) 

Every antenna j and direction d will correspond to a column 
in J, but the specific order of the columns (corresponding 
to the order in which we place the g^ elements in the aug¬ 
mented parameter vector g ) is completely up to us. 

Consider now the J H J product. This will consist of 2 x 2 
blocks, each of shape [A an t Adi r ] 2 . Let’s use i, c to designate 
the rows within each block, j, d to designate the columns, 
and define = rriffg^. The J H J matrix will then have 


J = 


m ( p d J a g ( q d) S 3 p 


(d) (d) 

g[, m - 


pqs^q 
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the following block structure: 


J H J = 


A 

B 


B h 

A 


5) E v\ c Mt 


(c) ( d) 

E VjisVijs^ 

0 


, (2.30) 


while the J H r term will be a vector of length 2A r an t A+H r , 
with the bottom half again being a conjugate of the top 
half. Within each half, we can write out the element corre¬ 
sponding to antenna j, direction d: 


t h ~ 

J r = 




(d) 

'jqs r 3<l s ’ 


(2.31) 




Finally, let us note that the property of Eq. 12.161 also 
holds for the DD case. It is easy to see that 


JVdi, 


J2ai d) mi d U d) 


= Jhg 


(2.32) 


Consequently, the shortcut of Eq. 12.181 also applies. 


3 INVERTING J H J AND SEPARABILITY 

In principle, implementing one of the flavours of calibra¬ 
tion above is “just” a matter of plugging Eqs. I2.12I + I3TJ1 
I2.22I + IT23112.261 + 12371 or 12.301 + 12311 into one the algorithms 
defined in Appendix 0 Note, however, that both the GN 
and LM algorithms hinge around inverting a large matrix. 
This will have a size of 2N an t or 2N an t IVan squared, for the 
DI or DD case respectively. With a naive implementation 
of matrix inversion, which scales cubically, algorithmic costs 
become dominated by the 0(N ant ) or 0(N ant N^ ir ) cost of 
inversion. 

In this section we investigate approaches to simplify¬ 
ing the inversion problem by approximating H = J H J by 
some form of (block-)diagonal matrix H. Such approxima¬ 
tion is equivalent to separating the optimization problem 
into subsets of parameters that are treated as independent. 
We will show that some of these approximations are simi¬ 
lar to or even fully equivalent to previously proposed algo¬ 
rithms, while others produce new algorithmic variations. 


3.1 Diagonal approximation and StefCal 

Let us first consider the DI case. The structure of J H J in 
Eq. 12.121 suggests that it is diagonally dominant (especially 
for larger A an t), as each diagonal element is a coherent sum 
of IVant amplitude-squared y-terms, while the off-diagonal 
elements are either zero or a product of two y terms. This 
is graphically illustrated in Fig. QJf). It is therefore not un¬ 
reasonable to approximate J H J with a diagonal matrix for 
purposes of inversion (or equivalently, making the assump¬ 
tion that the problem is separable per antenna): 


H = 


A 

0 


0 

A 


(3.1) 


This makes the costs of matrix inversion negligible - 0(N an t) 
operations, as compared to the 0(N ant ) cost of computing 
the diagonal elements of the Jacobian in the first place. The 


price of using an approximate inverse for J H J is a less accu¬ 
rate update step, so we can expect to require more iterations 
before convergence is reached. 

Combining this approximation with GN optimization 
and using Eq. 12.191 we find the following expression for the 
GN update step: 

g k+1 = Hvl J^d, (3.2) 

where Aul designates the top left quadrant of matrix A. 
Note that the condition of Eq. 12.201 is met: H \jHl = 
A~ 1 A = I, i.e. the GN update can be written in terms 
of d. This comes about due to (i) the off-diagonal blocks 
of H being null, which masks out the bottom half of JRl, 
and (ii) the on-diagonal blocks of H being an exact inverse. 
In other algorithms suggested below, the second condition 
particularly is not the case. 

The per-element expression, in the diagonal approxima¬ 
tion, is 

9p,k +1 — ( 'y ' VpqVpq) 'y ' Vpqdpq- (3-3) 

qjtp q^p 

Equation [T3Jis identical to the update step proposed 
bv iHamaked |2000), and later adopted by iMitchell et al.l 
(2008 ) for MWA calibration , and independently derived by 
ISalvini fe Wiinholdj (l2014aT) for the StefCal algorithm. 
Note that these authors arrive at the result from a differ¬ 
ent direction, by treating Eq. 12.21 as a function of g only, 
and completely ignoring the conjugate term. The resulting 
complex Jacobian ('Eq. 12.61) then has null off-diagonal blocks, 
and J H J becomes diagonal. 

Interestingly, applying the same idea to LM optimiza¬ 
tion (Eq. 11.911 , and remembering that H is diagonal, we can 
derive the following update equation instead: 

g k+1 = Y^g, + (3.4) 

which for A = 1 essentially becomes the basic 

averag e-up date step of St efCal. We should note that 
ISalvini fc Wiinholda ll2014ah empirically find better conver¬ 
gence when Eq. 13.21 is employed for odd k, and Eq. 13.41 for 
even k. In terms of the framework defined here, the basic 
StefCal algorithm can be succinctly described as com¬ 
plex optimization with a diagonally-approximated J H J, us¬ 
ing GN for the odd steps, and LM (\ = 1) for the even 
steps. 

Establishing this equivalence is very useful for our pur¬ 
poses, since the convergence p roperties of StefC al have 
been thoroughly explored bv ISalvini fe Wiinholdsl (l2014al l. 
and we can therefore hope to apply these lessons here. In 
particular, these authors have shown that a direct appli¬ 
cation of GN produces very slow (oscillating) convergence, 
whereas combining GN and LM leads to faster convergence. 
They also propose a number of variations of the algorithm, 
all of which are directly applicable to the above. 

Finally, let us note in passing that the update step of 
Eq. [3731 is embarrassingly parallel, in the sense that the up¬ 
date for each antenna is computed entirely independently. 


3.2 Separability of the direction-dependent case 

Now consider the problem of inverting J H J in the DD case. 
This is a massive matrix, and a brute force approach would 
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Figure 1. A graphical representation of J H J for a case of 40 antennas and 5 directions. Each pixel represents the amplitude of a single 
matrix element. The left column (a-d) shows conventional real-only Jacobians constructed by taking the partial derivatives w.r.t. the 
real and imaginary parts of the gains. The ordering of the parameters is (a) real/imaginary major, direction, antenna minor (i.e. antenna 
changes fastest); (b) real/imaginary, antenna, direction; (c) direction, real/imaginary, antenna; (d) antenna, real/imaginary, direction. 
The right column (e-h) shows full complex Jacobians with similar parameter ordering (direct/conjugate instead of real/imaginary). Note 
that panel (f) can also be taken to represent the direction-independent case, if we imagine each 5x5 block as one pixel. 
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scale as 0(N^ ir .N^ nt ). We can, however, adopt a few approx¬ 
imations. Note again that we are free to reorder our aug¬ 
mented parameter vector (which contains both g and g), as 
long as we reorder the rows and columns of J H J accord¬ 
ingly. 

Let us consider a number of orderings for g: 


• conjugate major, direction, antenna minor (CDA): 




( 2 ) ( 3 ) 

■9nL’9i- 


( iv dir ) (i) 
9 N an t ’ 9 1 


o (1) 

■'c\,, ri t 


• conjugate, antenna, direction (CAD): 

[„(!) 0 (JVdir) (1) „(iV dir ) (1) (N dir ) (1) 

[9i ■■ - 9 1 ) 5 2 ■■ - 9 2 1 93 ■ ■ ■ aAf ant > 9i 

• direction, conjugate, antenna (DCA): 

fo (1) o (1) o (1) o (1) o (2) o (2) o (2) 1 T 

• antenna, conjugate, direction (ACD): 


■ si^gl 1 '. 


■ g^rg?. 


(3.5) 

] T 

(3.6) 

(3.7) 

] T 

(3.8) 


Figure [T]( e-h) graphically illustrates the structure of the 
Jacobian under these orderings. 

At this point we may derive a whole family of DD cal¬ 
ibration algorithms - there are many ways to skin a cat. 
Each algorithm is defined by picking an ordering for g, then 
examining the corresponding J H J structure and specifying 
an approximate matrix inversion mechanism, then applying 
GN or LM optimization. Let us now work through a couple 
of examples. 


3.2.1 DC A: separating by direction 

Let us first consider the DC A ordering (Fig. [1^). The J H J 
term can be split into N d ; r x A r d j r blocks: 



' Jl ■ 

. . jf dir ' 

J H J = 




/A dir ■ 

1 

3 


where the structure of each 2AI an t x 2AT an t block at row c, 
column d, is exactly as given by Eq. 12.301 (or by Fig. [Tf, in 
miniature). 

The on-diagonal (“same-direction”) blocks will have 
the same structure as in the DI case (Eq. 12.121 or Eq. IA3I) . 
Consider now the off-diagonal (“cross-direction”) blocks . 
Their non-zero elements can take one of two forms: 


E 

ViqsVitl = E 9 ( q C) 9 ( q d) E ^qs^qs 

(3.10) 

qjil,s 

qjii s 



(c) (d) _(c)_(d)V^ - (c) (d) 

= 9i 9; 

(3.11) 


S S 


A common element of both is essentially a dot product 
of sky model components. This is a measure of how “non- 
orthogonal” the components are: 

x pq d) = ^ m$ s fh$ a . (3.12) 

S 

We should now note that each model component will typi¬ 
cally correspond to a source of limited extent. This can be 


expressed as 

(d) _ o(d) ,(d) , o\ 

' n pqtV ~ ^pqtv^pqtvl \O.LO) 

where the term S represents the visibility of that sky model 
component if placed at phase centre (usually only weakly 
dependent on t, v - in the case of a point source, for example, 
S is just a constant flux term), while the term 

ad = [i dt md: n d _ if, (3T4) 

represents the phase rotation to direction cr d (where Imn are 
the corresponding direction cosines), given a baseline vector 
as a function of time u pq (t). We can then approximate the 
sky model dot product above as 

v C cd ) _ q(c) Q (d) V - ' - 2 TTi[u pg (t).(cr c -cr d )]is/c /o i t\ 

s*-pq ^pq °pq / j c 


The sum over samples s is essentially just an integral 
over a complex fringe. We may expect this to be small (i.e. 
the sky model components to be more orthogonal) if the 
directions are well-separated, and also if the sum is taken 
over longer time and frequency intervals. 

If we now assume that the sky model components are 
orthogonal or near-orthogonal, then we may treat the “cross¬ 
direction” blocks of the J H J matrix in Eq. I3.9l as null. The 
problem is then separable by direction, and J H J is approx¬ 
imated by a block-diagonal matrix: 


-Jt 0 ‘ 

Q jNdir 

L U J N dir J 


(3.16) 


The inversion complexity then reduces to 0(Ndi T N^ nt ), 
which, for large numbers of directions, is a huge improve¬ 
ment on 0(Ai| ir A(f nt ). Either GN and LM optimization may 
now be applied. 


3.2.2 COHJONES: separating by antenna 

A complementary approach is to separate the problem by 
antenna instead. Consider the CAD ordering (Fig. [!}'). The 
top half of J H J then has the following block structure (and 
the bottom half is its symmetric conjugate): 




0 B\ ... Bf ant ' 


Hu = 


(3.17) 


A 


N ant 
N ant 


B 


i 

N ant 


B 


N ant 
N ant “ 


that is, its left half is block-diagonal, consisting of Ndii x Ndii 
blocks (which follows from Eq. 12. 3011 . while its right half 
consists of elements of the form given by Eq. 13.111 

By analogy with the StefCal approach, we may as¬ 
sume Bj « 0, i.e. treat the problem as separable by an¬ 
tenna. The H matrix then becomes block-diagonal, and we 
only need to compute the true matrix inverse of each A], 
The inversion problem then reduces to 0(A r dir A r a nt) in com¬ 
plexity, and either LM or GN optimization may be applied. 

For GN, the update step may be computed in direct 
analogy to Eq. 13.31 f noting that Eq. 12.201 holds'): 


9k+i — H vh J h H d. 


(3.18) 
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We may note in passing that for LM, the analogy is only 
approximate: 

Sk+i ~ i + ^ 9k + YjfxHvhJf 1 d, (3.19) 

since H is only approximately diagonal. 

This approach has been implemented as the CohJones 
(complex half-Jacobian optimization for n-directional esti¬ 
mation) algorithnJ3, the results of which applied to simu¬ 
lated data are presented below. 


3.2.3 ALL JONES: Separating all 

Perhaps the biggest simplification available is to start with 
CDA or CAD ordering, and assume a STEFCAL-style di¬ 
agonal approximation for the entirety of J H J. The matrix 
then becomes purely diagonal, matrix inversion reduces to 
O(Adir^Vant) 411 complexity, and algorithmic cost becomes 
dominated by the 0(A r d i r A r f nt ) process of computing the di¬ 
agonal elements of J H J. Note that Eq. l2.20l no longer holds, 
and the GN update step must be computed via the residuals: 

Sk+i = r, (3.20) 

with the per-element expression being 

4jfe+i = 9^1 + ( E E ( 3 - 21 ) 

v qAp,s ' q^p,s 


3.3 Convergence and algorithmic cost 

Evaluating the gain update (e.g. as given by Eq. 12.1811 in¬ 
volves a number of computational steps: 

(i) Computing H ~ J H J , 

(ii) Inverting H, 

(iii) Computing the J H d vector, 

(iv) Multiplying the result of (ii) and (iii). 

Since each of the algorithms discussed above uses a dif¬ 
ferent sparse approximation for J H J, each of these steps 
will scale differently (except iii, which is 0(N% nt N dd ) for 
all algorithms). Table [2] summarizes the scaling behaviour. 
An additional scaling factor is given by the number of it¬ 
erations required to converge. This is har der to quantify. 
For example, in our experience l|Smirnovll2013i l. an “exact” 
(LM) implementation of DI calibration problem converges 
in much fewer iterations than StefCal (on the order of a 
few vs. a few tens), but is much slower in terms of “wall 
time” due to the more expensive iterations (N^ nt vs. N an t 
scaling). This trade-off between “cheap-approximate” and 
“expensive-accurate” is typical for iterative algorithms. 

CohJones accounts for interactions between directions, 
but ignores interactions between antennas. Early experi¬ 
ence indicates that it converges in a few tens of iterations. 
All Jones uses the most approximative step of all, ignor¬ 
ing all interactions between parameters. Its convergence be¬ 
haviour is untested at this time. 

It is clear that depending on A r an t and Ndii, and also 
on the structure of the problem, there will be regimes where 

4 In fact it was the initial development of CohJones by iTassel 
ll2014tl that directly led to the present work. 


Table 2. The scaling of computational costs for a single iteration 
of the four DD calibration algorithms discussed in the text, broken 
down by computational step. The dominant term(s) in each case 
are marked by “f”. Not shown is the cost of computing the J H d 
vector, which is 0(IV 2 nt .V,]ir) for all algorithms. “Exact” refers 
to a naive implementation of GN or LM with exact inversion of 
the J H J term. Scaling laws for DI calibration algorithms may 
be obtained by assuming 7V d j r = 1, in which case CohJones or 
AllJones become equivalent to StefCal. 


algorithm 

H 

H 1 

multiply 

Exact 


OWLtN&l 

o(K nt N^) 

AllJones 

0(N% nt N diT ) t 

0(A ant A dir ) 

0(A ant A dir ) 

CohJones 

0(A 2 nt A 2 ir ) t 

0(N ant lV| ir ) t 

0(lV an tA d 2 . r ) 

DC A 

0(N^ t N dil ) 

o(Ni nt N diI y 

O(^nt^dir) 


one or the other algorithm has a computational advantage. 
This should be investigated in a future work. 


3.4 Smoothing in time and frequency 


From physical considerations, we know that gains do not 
vary arbitrarily in frequency and time. It can therefore be 
desirable to impose some sort of smoothness constraint on 
the solutions, which can improve conditioning, especially in 
low-SNR situations. A simple but crude way to do this is 
use solution intervals lSect. [T3l) . which gives a constant gain 
solution per interval, but produces non-physical jumps at the 
edge of each interval. Other approaches include a posteriori 
smoothing of solutions done on smaller intervals, as well as 
various filter-based algorithms llTassell20l4l . 

Another way to impose smoothness combines the ideas 
of solution intervals (Eq. 12.2111 and weighting (Eq. 12.241) . 
At every time/frequency sample to,vo, we can postulate a 
weighted LS problem: 

min V"' w{t — to, v - v 0 )\r pq tv\ 2 , (3.22) 

9 * 
pqtu 


where w is a smooth weighting kernel that upweighs samples 
at or near the current sample, and downweighs distant sam¬ 
ples (e.g., a 2D Gaussian). The solutions for adjacent sam¬ 
ples will be very close (since they are constrained by prac¬ 
tically the same range of data points, with only a smooth 
change in weights), and the degree of smoothness can be 
controlled by tuning the width of the kernel. 

On the face of it this approach is very expensive, since 
it entails an independent LS solution centred at every to, vq 
sample. The diagonal approximation above, however, allows 
for a particularly elegant and efficient way of implementing 
this in practice. Consider the weighted equations of Eas. l2.251 
and 12.271 and replace the sample index s by t, u. Under the 
diagonal approximation, each parameter update at to, vo is 
computed as: 


9 P ,k+i(to, vo) 


w(t to,V Vo )ypqth'dpqtu 

qAp,t,v _ 

w(t to, v Vo)ypqtvypqtv 


(3.23) 


qAp, t , v 

Looking at Eq. 13.231 it’s clear that both sums represent a 
convolution. If we define two functions of t, w. 


Oip{t,v) — ^ ' ypqtvdpqtv, j3p{t,v) — 'y ' ypqtvypqtv, (3.24) 
qAv qAp 
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then Eq. 13.231 corresponds to the ratio of two convolutions 
Sp,fc +1 (t,«/) = 7^, (3.25) 

XV O Pp 

sampled over a discrete t, v grid. Note that the formula¬ 
tion above also allows for different smoothing kernels per 
antenna. Iterating Eq. l3.25l to convergence at every t, v slot, 
we obtain per-antenna arrays of gain solutions answering 
Eq. 13.221 These solutions are smooth in frequency and time, 
with the degree of smoothness constrained by the kernel w. 

There is a very efficient way of implementing this in 
practice. Let’s assume that d pq and y vq are loaded into mem¬ 
ory and computed for a large chunk of t, v values simulta¬ 
neously (any practical implementation will probably need 
to do this anyway, if only to take advantage of vectorized 
math operations on modern CPUs and GPUs). The param¬ 
eter update step is then also evaluated for a large chunk of 
t, v, as are the a p and f5 p terms. We can then take advantage 
of highly optimized implementations of convolution (e.g. via 
FFTs) that are available on most computing architectures. 

Smoothing may also be trivially incorporated into the 
All Jones algorithm, since its update step fEq. l3.2fll has ex¬ 
actly the same structure. A different smoothing kernel may 
be employed per direction (for example, directions further 
from the phase centre can employ a narrower kernel). 

Since smoothing involves computing a g value at ev¬ 
ery t, u point, rather than one value per solution interval, 
its computational costs are correspondingly higher. To put 
it another way, using solution intervals of size Nt x N„ in¬ 
troduces a savings of Nt x 7V„ (in terms of the number of 
invert and multiply steps required, see Table [2]) over solving 
the problem at every t, v slot; using smoothing foregoes these 
savings. In real implementations, this extra cost is mitigated 
by the fact that the computation given by Eq. 13.211 may be 
vectorized very efficiently over many t, v slots. However, this 
vectorization is only straightforward because the matrix in¬ 
version in StefCal or All Jones reduces to simple scalar 
division. For CohJones or DCA this is no longer the case, so 
while smoothing may be incorporated into these algorithms 
in principle, it is not clear if this can be done efficiently in 
practice. 


4 THE FULLY POLARIZED CASE 

To incorporate polarization, let us start by rewriting the 
basic RIME of Eo. I2TT1 usin g 2x 2 matrices (a full derivation 
may be found in lSmirnovll2011ah : 

Dpq — GpAI p qG q Npq- (4-1) 

Here, D pq is the visibility matrix observed by baseline pq, 
M pq is the sky coherency matrix, G p is the Jones matrix 
associated with antenna p, and N pq is a noise matrix. Quite 
importantly, the visibility and coherency matrices are Her- 
mitian: D pq = D^ p , and M pq = M^ p . The basic polariza¬ 
tion calibration problem can be formulated as 

min ||Jip 9 ||_F, Rpq = D pq — G p M pq G q . (4.2) 

{Cxp/ 

pq 

This is a set of 2 x 2 matrix equations, rather than the 
vector equations employed in the complex NNLS formalism 
above fEo. 11.411 . In principle, there is a straightforward way 
of recasting matrix equations into a form suitable to Eq. ll. 41 


we can vectorize each matrix equation, turning it into an 
equation on 4-vectors, and then derive the complex Jacobian 
in the usual manner (Eq. 11.711 . 

In this section we will obtain a more elegant derivation, 
by employing an operator calculus where the “atomic” ele¬ 
ments are 2x2 matrices rather than scalars. This will allow 
us to define the Jacobian in a more transparent way, as a 
matrix of linear operators on 2 x 2 matrices. Mathematically, 
this is completely equivalent to vectorizing the problem and 
applying Wirtinger calculus (each matrix then corresponds 
to 4 elements of the parameter vector, and each operator 
in the Jacobian becomes a 4 x 4 matrix block). The ca¬ 
sual reader may simply take the postulates of the following 
section on faith in particular, that the operator calculus 
approach is completely equivalent to using 4x4 matrices. 
A rigorous formal footing to this is given in Appendix [B] 


4.1 Matrix operators and derivatives 


By matrix operator, we shall refer to any function T that 
maps a 2 x 2 complex matrix to another such matrix: 

JF : C 2x2 —¥ C 2x2 . (4.3) 


When the operator T is applied to matrix X, we’ll write the 
result as Y = FX, or F[X] if we need to avoid ambiguity. 

If we fix a complex matrix A, then two interesting (and 
linear) matrix operators are right-multiply by A, and left- 
mult iply by A: 


UaX = XA 
CaX =AX 


(4.4) 


Appendix[B]formally shows that all linear matrix operators, 
including 7 Za and Ca, can be represented as multiplication 
of 4-vectors by 4 x 4 matrices. 

Just from the operator definitions, it is trivial to see 

that 

CaCb =Cab, XaNb = Xba, [72.a] 1 = 72 a - i (4.5) 

Consider now a matrix-valued function of n matrices 
and their Hermitian transposes 

F(Gi... G„, Gq ... Gp), (4.6) 


and think what a consistent definition for the partial matrix 
derivative dF/dGt would need be. A partial derivative at 
some fixed point Go = (Gi . .. G n , Gq .. . G%) is a local 
linear approximation to F, i.e. a linear function mapping 
an increment in an argument A Gi to an increment in the 
function value A F. In other words, the partial derivative is 
a linear matrix operator. Designating this operator as V = 
dF/dGi, we can write the approximation as: 

E(..„ Gi + A Gi, ...) - F{..., Gi ,...) « VAGi. (4.7) 


Obviously, the linear operator that best approximates a 
given linear operator is the operator itself, so we necessarily 
have 


8(GA) 

dG 


= 7 Za, 


d(AG H ) _ 

dG 11 


(4.8) 


Appendix[B]puts this on a formal footing, by providing 
formal definitions of Wirtinger matrix derivatives 


d F dF 

dG’ dG H 


(4.9) 
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that are completely equivalent to the partial complex Jaco- 
bians defined earlier. 

Note that this calculus also offers a natural way of tak¬ 
ing more complicated matrix derivatives (that is, for more 
elaborate versions of the RIME). For example, 

d(AGB) 


dG 


= CaTZb, 


(4.10) 


which is a straightforward manifestation of the chain rule: 

AGB = A(GB). 


4.2 Complex Jacobians for the polarized RIME 

Let us now apply this operator calculus to Eq. 14.21 Taking 
the derivatives, we have: 


dV pi 

dG v 


- Km pq gh, 


a dV ™ r 
“ d 8G* ~ Ca ~ 


(4.11) 


If we now stack all the gain matrices into one augmented 
“vector of matrices”: 


G = [Gi 


-r H lT 
r JVantl 1 


(4.12) 


then we may construct the top half of the full complex Jaco¬ 
bian operator in full analogy with the derivation of Eq. 12.61 
We’ll use the same “compound index” convention for pq. 
That is, [pq] will represent a single index running through 
M values (i.e. enumerating all combinations of p < q). 


0 — 1 • • ■ N ant 


3 — 1 ■ ■ - N ant 


J U = [ T^M pq GHSp £G p M pq fiq ] }[p<j] = l...iV bl (p<q) 

(4.13) 

Note that there are two fully equivalent ways to read 
the above equation. In operator notation, it specifies a linear 
operator J u that maps a 2TVant-vector of 2 x 2 matrices to an 
IVbi-vector of 2 x 2 matrices. In conventional matrix notation 
(Appendix [B), Ju is just a 47Vbi x 8Aant matrix; the above 
equation then specifies the structure of this matrix in terms 
of 4 x 4 blocks, where each block is the matrix equivalent of 
the appropriate 77 or C operator. 

Consider now the bottom half of the Jacobian. In 
Eq. 11.71 this corresponds to the derivatives of the conjugate 
residual vector fk, and can be constructed by conjugating 
and mirroring J u- Let us modify this construction by tak¬ 
ing the derivative of the Hermitian transpose of the residuals 
instead. Note that substituting the Hermitian transpose for 
element-by-element conjugation corresponds to a simple re¬ 
ordering of some rows in the conjugate residual vector (i.e. 
reordering of the LS equations), which we are always free to 
do. Let us then construct the augmented residual vector of 
matrices as: 


R = 


R p 

R„ 


}[pg]=l...JV b i (p<q) 
}[p 9 ] = l...iVbl (p<q) 


Now, since Vf„ = G q Mf„Gw , we have 


dV* _ 

sg 7 - n M? q G?’ 


dV H 

and -rr = C G M h . 

dG H G a M„„. 


(4.14) 


(4.15) 


T q v -*p 

and we may write out the full complex Jacobian as 

}[pq] = l...iVbi ( p<q ) 

}[pq]=i...Af b i (p<q) 

(4.16) 


J = - 


R-M pq GH SJ p C Gp M pq 5 J q 
£-G a MH Sp 


Rm h „G h fiq 


We may now make exactly the same observation as we 
did to derive Eq. m and rewrite both J and R in terms of 
a single row block. The pq index will now run through 27Vbi 
values (i.e. enumerating all combinations of p ^ q)\ 

J = \lZ Mpq GH5 3 p C Gp M pq S }[pq]=1...2N hl (p^q) (4.17) 
and 

R = [-Rpq 1 ] }[p?]=1...2iV bI (p^q) (4.18) 


This is in complete analogy to the derivations of the unpo¬ 
larized case. For compactness, let us now define 

Y pq = M pq G q , Y q p = M pq Gf (4.19) 

noting that 

Y” = G q Mf q , = GpMpq. (4.20) 
Employing Eq. IB12I the J and J H terms can be written as 


J H = 


77.',- u fip 


CYqJl 


J = [Ry p J j p C y h5 j] 


(4.21) 


We can now write out the J H J term, still expressed in 
terms of operators, as: 


J H J = 


diag J2 R Yiq Yt 

q^i ' q 


77 


Y " Y fi 

0, 


*74 


^Y i:j R-Yji , ijLj 
0, i=j 


diag E C Y ia Y.H 

q^i g 


(4.22) 


(Note that this makes use of the property 77 a 77 b = 77 ba 
and CaC-b = Cab-) Compare this result to Eq. 12.121 

As for the J H R term, we can directly apply the linear 
operators appearing in J H to the matrices in R. This results 
in the following vector of 2 x 2 matrices: 


J H R = 


E Yf q Rp q S£ 

pq . 

E RpqYqpSq 

.pq 



'E 

q^ti 




(4.23) 


where the second equality is established by swapping the p 
and q indices. Unsurprisingly, and by analogy with Eo. 12.141 
the bottom half of the vector is Hermitian with respect to 
the top. 


4.3 Parameter updates and the diagonal 
approximation 

The relation of Eq. 12.161 also apply in the fully-polarized 
case. It is easy to see that if we define the augmented data 
and model vectors of matrices as 

D = [G m ], Y=[GpMp q G^l (4.24) 

then V = JlG holds, and the GN update step can be writ¬ 
ten as 

Gfc+i = {J H jr^D. (4.25) 

By analogy with Eq. 12.181 this equation also holds when 
J H J is approximated, but only if the condition of Eo. 12.201 
is met. 

To actually implement GN or LM optimization, we still 
need to invert the operator represented by the J H J matrix 
in Eq. 14.221 We have two options here. 
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The brute-force numerical approach is to substitute the 
4x4 matrix forms of the 1Z and £ operators (Eqs. IB 101 
and 1B1 11) into the equation, thus resulting in conventional 
matrix, and then do a straightforward matrix inversion. 

The second option is to use an analogue of the diago¬ 
nal approximation described in Sect. 13.11 If we neglect the 
off-diagonal operators of Eq. 14.221 the operator form of the 
matrix is diagonal, i.e. the problem is again treated as being 
separable per antenna. As for the operators on the diagonal, 
they are trivially invertible as per Eq. 14.51 We can there¬ 
fore directly invert the operator form of J H J and apply it 
to J H R, thus arriving at a simple per-antenna equation for 
the GN update step: 




E 


Y H D 

1 pq^Pq 


E 


V V 

1 pq L pq 


(4.26) 


This is, once again, equivalent t o the po la rized Ste- 
fCal up date step proposed by ISmirnovI (l2013h and 
ISalvini fc Wiinholdsl (2014aj)- 


4.4 Polarized direction-dependent calibration 


Let us now briefly address the fully-polarized DD case. This 
can be done by direct analogy with Sect. 12.51 using the op¬ 
erator calculus developed above. As a result, we arrive at 
the following expression for J H J\ 


'R- Y (.A)yr[c)H 
q^i,s t<Z s 


~^TZ Y (a)H C Y (d)H 


J2 £y(c) (?) , 




J2 £y(c)y(d)H 
q^i,s ic l s i( l s 


(4.27) 


using the normal shorthand of Yp d q S = M^?q S Gq d ^ H . The 
J fl R term is then 


J H R = 


E \r(d)H p 

1 iqs f'-iqs 

q^i,s 


(4.28) 


All the separability considerations of Sect. [3] now apply, and 
polarized versions of the algorithms referenced therein may 
be reformulated for the fully polarized case. For example: 


• If we assume separability by both direction and an¬ 
tenna, as in the All Jones algorithm, then the H matrix 
is fully diagonal in operator form, and the GN update step 
can be computed as 


SG 


(rf) 

p,k -\-1 


V Y {d)H R 

/ j 1 pqs i *'2 

q^p,s 


E 


’y(d) -y(d)H 
± pqs - 1 pqs 


(4.29) 

Note that in this case (as in the unpolarized AllJones ver¬ 
sion) the condition of Eq. 12.201 is not met, so we must use 
the residuals and compute 5G. 

• If we only assume separability by antenna, as in the 
CohJones algorithm, then the H matrix becomes 4A’ c ]j r x 
4A’ c ji r -block-diagonal, and may be inverted exactly at a cost 
of 0(A r J ir A r an t). The condition of Eq. 12.201 is met. 


Equations 14. 27ff4.281 can be considered the principal re¬ 
sult of this work. They provide the necessary ingredients 
for implementing GN or LM methods for DD calibration, 
treating it as a fully complex optimization problem. The 
equations may be combined and approximated in different 
ways to produce different types of calibration algorithms. 

Another interesting note is that Eo. 14.291 and its ilk are 
embarrassingly parallel, since the update step is completely 
separated by direction and antenna. This makes it partic¬ 
ularly well-suited to implementation on massively parallel 
architectures such as GPUs. 


5 OTHER DD ALGORITHMIC VARIATIONS 

The mathematical framework developed above (in particu¬ 
lar, Eos. 14.27114.2811 provides a general description of the po¬ 
larized DD calibration problem. Practical implementations 
of this hinge around inversion of a very large J H J matrix. 
The family of algorithms proposed in Sect. [3] takes different 
approaches to approximating this inversion. Their conver¬ 
gence properties are not yet well-understood; however we 
may note that the StefCal algorithm naturally emerges 
from this formulation as a specific case, and its conve rgence 
has been established by lSalvini fe Wiinholdsl (l2014a[f . This 
is encouraging, but ought not be treated as anything more 
than a strong pointer for the DD case. It is therefore well 
worth exploring other approximations to the problem. In 
this section we map out a few such options. 


5.1 Feed forward 

ISalvini fe Wiinholdsl (|2014bl ~) propose variants of the Ste¬ 
fCal algorithm (“2-basic” and “2-relax”) where the results 
of the update step (Eq. 14.261 in essence) are computed se¬ 
quentially per antenna, with updated values for Gi . . . Gk -1 
fed forward into the equations for Gk (via the appropriate 
Y terms). This is shown to substantially improve conver¬ 
gence, at the cost of sacrificing the embarrassing parallelism 
by antenna. This technique is directly applicable to both the 
AllJones and Coh Jones algorithms. 

The Coh Jones algorithm considers all directions si¬ 
multaneously, but could still implement feed-forward by an¬ 
tenna. The AllJones algorithm ('Eq. 14.291) could implement 
feed-forward by both antenna (via V) and by direction - by 
recomputing the residuals R to take into account the up¬ 
dated solutions for G (1 ' ... G ( ' d ^ 1 ' > before evaluating the so¬ 
lution for G^. The optimal order for this, as well as whether 
in practice this actually improves convergence to justify the 
extra complexity, is an open issue that remains to be inves¬ 
tigated. 


5.2 Triangular approximation 


It is also straightforward to add weights and/or sliding 
window averaging to this formulation, as per Sect. 12.41 and 

I3T1 


The main idea of feed-forward is to take into account so¬ 
lutions for antennas (and/or directions) 1 ,..., k — 1 when 
computing the solution for k. A related approach is to ap- 
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proximate the J H J matrix as block-triangular: 


H = 


Ji 


o 

J 2 2 


-Jn 

Jl 

stN 

Jn J 

of this is 

also 

block 

triangular: 


\K\ 

0 

0 ' 

H X = 

K.\ 

/c| 

• • O 


-h3 

_1 

/C 2 v 

^ . 

1_ 


(5.1) 


(5.2) 


which can be computed using Gaussian elimination: 
K\ = [Jr 1 ]" 1 

ici = [Jir 1 

fr'-l 1^2 ST 1 I'-’l 

/v'2 — — 

ICl = [Js 3 ]" 1 

A-3 — ~^3d3 A 2 

fc-3 = ~ ^c|[J3 1 /Ci + J 3 K. 2 ] 


(5.3) 


From this, the GN or LM update steps may be derived di¬ 
rectly. 


5.3 Peeling 

The p eeling procedure was originally suggested bv lNoordaml 
|200J) as a “kludge”, i.e. an implementation of DD calibra¬ 
tion using the DI functionality of existing packages. In a 
nutshell, this procedure solves for DD gains towards one 
source at a time, from brighter to fainter, by 

(i) Rephasing the visibilities to place the source at phase 
centre; 

(ii) Averaging over some time/frequency interval (to sup¬ 
press the contribution of other sources); 

(iii) Doing a standard solution for DI gains (which ap¬ 
proximates the DD gains towards the source); 

(iv) Subtracting the source from the visibilities using the 
obtained solutions; 

(v) Repeating the procedure for the next source. 

The term “peeling” comes from step (iv), since sources 
are “peeled” away one at a tim^. 

Within the framework above, peeling can be considered 
as the ultimate feed forward approach. Peeling is essentially 
feed-forward by direction, except rather than taking one step 
over each direction in turn, each direction is iterated to full 
convergence before moving on to the next direction. The 
procedure can then be repeated beginning with the bright¬ 
est source again, since a second cycle tends to improve the 
solutions. 


Amplitude 



Phase 



Figure 3. Amplitude (left panel) and phase (right panel) of the 
block-diagonal matrix (J H J )ul f° r the dataset described in the 
text. Each block corresponds to one antenna; the pixels within a 
block correspond to directions. 


a: 
fa 

E 



-0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 

I [radian] 


Figure 5. In order to conduct direction-dependent calibration, 
sources are clustered using a Voronoi tessellation algorithm. Each 
cluster has its own DD gain solution. 


5.4 Exact matrix inversion 

Better approximations to (J H J) -1 (or a faster exact in¬ 
verse) may exist. Consider, for example, Fig. [Tf: the matrix 
consists of four blocks, with the diagonal blocks being triv¬ 
ially invertible, and the off-diagonal blocks having a very 
specific structure. All the approaches discussed in this pa¬ 
per approximate the off-diagonal blocks by zero, and thus 
yield algorithms which converge to the solution via many 
cheap approximative steps. If a fast way to invert matrices 
of the off-diagonal type (faster than 0(N 3 ), that is) could be 
found, this could yield calibration algorithms that converge 
in fewer more accurate iterations. 


5 The term “peeling” has occasionally been misappropriated to 
describe other schemes, e.g. simultaneous independent DD gain 
so lutions . We cons ider this a misuse: both the original formulation 
bv lNoordaml (120041 ), and the word “peeling” itself, strongly implies 
dealing with one direction at a time. 


6 IMPLEMENTATIONS 


6.1 StefCal in MeqTrees 


Some of the id eas above have already been implemented in 
the MeqTrees (Noordam & Smirnov l2010h version of Ste- 
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Figure 2. Amplitude (top row) and phase (bottom row) of the difference between the estimated and true gains, as a function of iteration. 
Columns correspond to directions. Different lines correspond to different antennas. 



Figure 4. Simulation with time-variable DD gains. We show a deconvolved image (left) where no DD solutions have been applied, a 
residual image (centre) made by subtracting the sky model (in the visibility plane) without any DD corrections, and a residual image 
(right) made by subtracting the sky model with CoHjoNES-estimated DD gain solutions (right). The color scale is the same in all panels. 
In this simulation, applying Coh Jones for DD calibration reduces the residual rms level by a factor of ~ 4. 


fCal (|Smirnov! 120131 ). In particular, the MeqTrees version 
uses peeling (Sect. 15.3() to deal with DD solutions, and im¬ 
plements fully polarized StefCal with support for both so¬ 
lution intervals and time/frequency smoothing with a Gaus¬ 
sian kernel (as per Sect. i3~a . This has already been applied 
to JVLA L-band data to obtain what is (at time of writing) 
a world record dynami c rang e (3.2 million) image of the field 
around 3C147 dPerlevll20ia) . 

6.2 CohJones tests with simulated data 

The CohJones algorithm, in the unpolarized version, has 
been implemented as a standalone Python script that uses 
the pyrajjf] and casacor^H libraries to interface to Measure- 

6 https: //code. google. com/p/pyrap 
1 https://code.google.com/p/casacore 


ment Sets. This section reports on tests of our implementa¬ 
tion with simulated Low Frequency Array (LOFAR) data. 

For the tests, we build a dataset using a LOFAR layout 
with 40 antennas. The phase center is located at 5 = +52°, 
the observing frequency is set to 50 MHz (single channel), 
and the integrations are 10s. We simulate 20 minutes of data. 

For the first test, we use constant direction-dependent 
gains. We then run CohJones with a single solution inter¬ 
val corresponding to the entire 20 minutes. This scenario is 
essentially just a test of convergence. For the second test, 
we simulate a physically realistic time-variable ionosphere 
to derive the simulated DD gains. 


6.2.1 Constant DD gains 

To generate the visibilities for this test, we use a sky model 
containing five sources in an “+” shape, separated by 1°. The 
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Direction 0 



Time [min] 


Figure 6. The complex phases of the DD gain terms (for all 
antennas and a single direction) derived from the time-variable 
TEC screen used in Sect. 16.2.21 


gains for each antenna p, direction d are constant in time, 
and are taken at random along a normal distribution (ff ~ 
JV ( 0,1) + iAf { 0, 1). The data vector d is then built from 
all baselines, and the full 20 minutes of data. The solution 
interval is set to the full 20 minutes, so a single solution per 
direction, per antenna is obtained. 

The corresponding matrix (J H J )ul is shown in Fig. [31 
It is block diagonal, each block having size JVdir x Mlir- The 
convergence of gain solutions as a function of direction is 
shown in Fig. [5] It is important to note that the prob¬ 
lem becomes better conditioned (and CohJones converges 
faster) as the blocks of (J H J )ul become more diagonally- 
dominated (equivalently, as the sky model components be¬ 
come more orthogonal). As discussed in Sect. 13.2711 this hap¬ 
pens (i) when more visibilities are taken into account (larger 
solution intervals) or (ii) if the directions are further away 
from each other. 


6.2.2 Time-variable DD gains 

To simulate a more realistic dataset, we use a sky model 
composed of 100 point sources of random (uniformly dis¬ 
tributed) flux density. We also add noise to the visibilities, 
at a level of about 1% of the total flux. We simulate (scalar, 
phase-only) DD gains, using an ionospheric model consisting 
of a simple phase screen (an infinitesimally thin layer at a 
height of 100 km). The total electron content (TEC) values 
at the set of sample points are generated using Karhunen- 
Loeve decomposition (t he spati a l correlati on is given by Kol¬ 
mogorov turbulence, see Ivan der Tolll2009i) . The constructed 
TEC-screen has an amplitude of ~ 0.07 TEC-Unit, and the 
corresponding DD phase terms are plotted in Fig. [6] 

For calibration purposes, the sources are clustered in 10 
directions using Voronoi tessellation (Fig. [5]). The solution 
time-interval is set to 4 minutes, and a separate gain solution 
is obtained per each direction. Fig.[]]sliows images generated 
from the residual visibilities, where the best-fitting model is 
subtracted in the visibility domain. The rms residuals after 
CohJones has been applied are a factor of ~ 4 lower than 
without DD solutions. 


CONCLUSIONS 

Recent developments in optimization theory have extended 
traditional NLLS optimization approaches to functions of 
complex variables. We have applied this to radio interfero¬ 
metric gain calibration, and shown that the use of complex 
Jacobians allow for new insights into the problem, leading 
to the formulation of a whole new family of DI and DD cali¬ 
bration algorithms. These algorithms hinge around different 
sparse approximations of the J H J matrix; we show that 
some recent algorithmic developments, notably StefCal, 
naturally fit into this framework as a particular special case 
of sparse (specifically, diagonal) approximation. 

The proposed algorithms have different scaling proper¬ 
ties depending on the selected matrix approximation - in 
all cases better than the cubic scaling of brute-force GN or 
LM methods - and may therefore exhibit different compu¬ 
tational advantages depending on the dimensionality of the 
problem (number of antennas, number of directions). We 
also demonstrate an implementation of one particular algo¬ 
rithm for DD gain calibration, CohJones. 

The use of complex Jacobians results in relatively com¬ 
pact and simple equations, and the resulting algorithms tend 
to be embarrassingly parallel, which makes them particu¬ 
larly amenable to implementation on new massively-parallel 
computing architectures such as GPUs. 

Complex optimization is applicable to a broader range 
of problems. Solving for a large number of independent DD 
gain parameters is not always desirable, as it potentially 
makes the problem under-constrained, and can lead to arte¬ 
facts such as ghosts and source suppression. The alternative 
is solving for DD effect models that employ [a smaller set 
of] physical parameters, such as parameters of the primary 
beam and ionosphere. If these parameters are complex, then 
the complex Jacobian approach applies. Finally, although 
this paper only treats the NLLS problem (thus implicitly 
assuming Gaussian statistics), the approach is valid for the 
general optimization problem as well. 

Other approximations or fast ways of inverting the com¬ 
plex J H J matrix may exist, and future work can potentially 
yield new and faster algorithms within the same unifying 
mathematical framework. This flexibility is particularly im¬ 
portant for addressing the computational needs of the new 
generation of the so-called “SKA pathfinder” telescopes, as 
well as the SKA itself. 
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APPENDIX A: J AND J H J FOR THE 
THREE-ANTENNA CASE 

To give a specific example of complex Jacobians, consider the 
3 antenna case. Using the numbering convention for pq of 12, 
13, 32, we obtain the following partial Jacobians (Eqs. 12.41 
and 12.511 : 



m 12 g2 

0 

O' 


'0 

311012 

0 

J k — 

011333 

0 

0 

i Jk* — 

0 

0 

310113 


0 

102333 

0 


0 

0 

320123 


(Al) 


We then get the following expression for the full com¬ 
plex Jacobian J fEo. 12.81) : 


011232 

0 

0 

0 

310112 

0 

011333 

0 

0 

0 

0 

310113 

0 

Ol2l3l 

0 

320121 

0 

0 

0 

012333 

0 

0 

0 

320123 

0 

0 

Ol3l3l 

310131 

0 

0 

0 

0 

013232 

0 

33 0132 

0 


Then, with the usual shorthand of y vq = m pq g q , the 
J H J term becomes: 


^12+^13 0 0 0 V 12 V 21 V 13 V 31 

0 y \2 + V 23 0 2/12!/21 0 V 23 V 32 

0 0 yi 3 +V 23 V 13 V 31 V 23 V 32 0 

0 yi 2 V 21 1/131/31 V 12 +V 13 0 0 

1/121/21 o V 23 V 32 0 1/12+1/23 0 

V 13 V 31 1/231/32 0 0 0 Vl 3 +y 23 

Finally, the 3-antenna J H r term becomes 


(A3) 


J 


y 12*12 + 313*13 

321*21 + 323*23 
331*31 + 332*32 
V12T12 + Vl3ri3 
J /21 ^21 + 1/23723 
.1/31731 + 1/32732 


(A4) 


APPENDIX B: OPERATOR CALCULUS 


First, let us introduce the vectorization operator “vec” and 
its inverse in the usual (stacked columns) wajjf). For a 2 x 2 
matrix X: 


*11 


*n 

*21 

-1 

*21 


, vec 


*12 


*12 

_*22. 


_*22_ 


(Bl) 


which sets up an isomorphism between the space of 2 x 2 
complex matrices C 2x2 and the space C 4 . Note that the “vec” 
operator is linear, in other words the isomorphism preserves 
linear structure: 


vec (X + aY) = vec X + a vec X, (B2) 

as well as the Frobenius norm: 


||vecX|| F = ||X|| F . (B3) 

Consider now the set of all linear operators on C 2x2 , or 
Lin(C 2x2 ,C 2x2 ). Any such linear operator B , whose action 
we’ll write as BX, can be associated with a linear operator 
on 4-vectors B £ Lin(C 4 ,C 4 ), by defining B as 

Bx = vecBX, X=vec~ 1 x. (B4) 

Conversely, any linear operator on 4-vectors B can be asso¬ 
ciated with a linear operator on 2 x 2 matrices by defining 

BX=vec~ 1 Bx, x = vecX. (B5) 

Now, the set Lin(C 4 ,C 4 ) is simply the set of all 4 x 4 ma¬ 
trix multipliers. Equations IB5l and lB5l establish a one-to-one 
mapping between this set and the set of linear operators on 
2x2 matrices. In other words, the “vec” operator induces 
two isomorphisms: one between C 4 and C 2x2 , and the other 
between C 4x4 and linear operators on C 2x2 . We will desig¬ 
nate the second isomorphism by the symbol W: 

W B = B: Bx = vec (£>vec -1 x) , . 

W“ 1 B = B: BX = vec -1 (B vec X) ^ 

Note that W also preserves linear structure. 


8 Note that lHamakerl (2000h employs a similar formalism, but 
uses the (non-canonical) stacked rows definition instead. 
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Of particular interest to us are two linear operators on 
<C 2x2 : right-multiply by some 2x2 complex matrix A, and 
left-multiply by A: 

KaX = XA, C A X = AX (B7) 

The W isomorphism ensures that these operators can be 
represented as multiplication of 4-vectors by specific kinds 
of 4 x 4 matrices. The matrix outer product proves to be 
useful here, and in particular the following basic relation: 

vec (ABC) = ( C T ® A) vec B, (B8) 


from which we can derive the matrix operator forms via: 


vec (IX A) = ( A t 0 I) vec X 
vec(AXI) = (I® A) vecX, 


(B9) 


which gives us 


and 


W Ua = 


W C A = 


All 

0 

a 21 

0 

0 

an 

0 

a2i 

Ul2 

0 

022 

0 

_ 0 

ai2 

0 

a22 

an 

ai2 

0 

0 

a 21 

022 

0 

0 

0 

0 

an 

ai2 

0 

0 

021 

022 


(BIO) 


(BH) 


From this we get the important property that 

= W77 a h , [W£a] H = W C a h . (B12) 

Note that Eq. 14.51 which we earlier derived from the 
operator definitions, can now be verified with the 4x4 forms. 
Note also that Eq. S3] is equally valid whether interpreted 
in terms of chaining operators, or multipying the equivalent 
4x4 matrices. 


B1 Derivative operators and Jacobians 

Consider a matrix-valued function of a a matrix argument 
and its Hermitian transpose, F(G,G H ). Yet again, we can 
employ the “vec” operator to construct a one-to-one mapping 
between such functions and 4-vector valued functions of 4- 
vectors: 

f(g,g ) = vecF(vec -1 g,(vec~ 1 g) T ). (B13) 

Consider now the partial and conjugate partial Jaco¬ 
bians of / with respect to g and g , defined as per the for¬ 
malism of Sect. [I] These are 4x4 matrices, as given by 
Eg. 1231 

Jk = [dfi/dgj], J k * = [dfi/dgf, (B14) 

that represent local linear approximations to /, i.e. linear 
operators on C 4 that map increments in the arguments A g 
and A g to increments in the function value A/. The W 
isomorphism defined above matches these operators to linear 
operators on C 2x2 that represent linear approximations to 
F. It is the latter operators that we shall call the Wirtinger 
matrix derivatives of F with respect to G and G H : 

fS =w " 1(Jfc) ’ ^ =w ' 1(j ^* ) - (m5) 

This is more than just a formal definition: thanks to 


the W isomorphism, the operators given by Eq. IB15I are 
Wirtinger derivatives in exactly the same sense that the Ja¬ 
cobians of Eq. IB14I are Wirtinger derivatives, with the for¬ 
mer being simply the C 2x2 manifestation of the gradient 
operators in C 4 , as defined in Sect.|T] However, operating in 
C 2x2 space allows us to write the larger Jacobians of Sect.[l] 
in terms of simpler matrices composed of operators, result¬ 
ing in a Jacobian structure that is entirely analogous to the 
scalar derivation. 


APPENDIX C: GRADIENT-BASED 
OPTIMIZATION ALGORITHMS 

This appendix documents the various standard least-squares 
optimization algorithms that are referenced in this paper: 


Cl Algorithm SD (steepest descent) 

(i) Start with a best guess for the parameter vector, zo; 

(ii) At each step k, compute the residuals r^, and the 
Jacobian J = J(zk)', 

(iii) Compute the parameter update as (note that due to 
redundancy, only the top half of the vector actually needs 
to be computed): 

5z k = -A J H r k , (Cl) 

where A is some small value; 

(iv) If not converge^, set Zk+i = z k + Sz, and go back 
to step (ii). 


C2 Algorithm GN (Gauss-Newton) 

(i) Start with a best guess for the parameter vector, zo; 

(ii) At each step k, compute the residuals r k , and the 
Jacobian J = J(z k ); 

(iii) Compute the parameter update Sz using Eq. ll.9l with 
A = 0 (note that only the top half of the vector actually 
needs to be computed); 

(iv) If not converged, set Zk+i = z k + Sz, and go back to 
step (ii). 


C3 Algorithm LM (Levenberg-Marquardt) 

Several variations of this exist, but a typical one is: 

(i) Start with a best guess for the parameter vector, zo, 
and an initial value for the damping parameter, e.g. A = 1; 

(ii) At each step fc, compute the residuals r k , and the cost 
function = II^T-IIf. 

(iii) If xl ^ Xk-i (unsuccessful step), reset z k = z k -i, 
and set A = XK (where typically K = 10); 

(iv) Otherwise (successful step) set A = A/A'; 

(v) Compute the Jacobian J = J(z k )', 

(vi) Compute the parameter update Szk using Eq. 11.91 
(note that only the top half of the vector actually needs to 
be computed); 

(vii) If not converged, set Zk+i = Zk + Sz, and go back 
to step (ii). 


see below 
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C4 Convergence 

All of the above algortihms iterate to “convergence”. One 
or more of the following convergence criteria may be imple¬ 
mented in each case: 

• Parameter update smaller than some pre-defined 
threshold: ||<5z||f < <5o- 

• Improvement to cost function smaller than some pre¬ 
defined threshold: Xk-i ~ Xk < £ o- 

• Norm of the gradient smaller than some threshold: 
||J||f < 7o- 



